STATS 32 Session 3: Data Visualization

Kenneth Tay

Oct 1, 2019

http://web.stanford.edu/~kjytay/courses/stats32-aut2019/

Recap of week 1

Vectors

vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"
vec[c(2,4)]
## [1] "b" NA

Lists

classes <- list(quarter = "Fall 2018/19",
             ID = c("STATS 32", "STATS 101", "STATS 200"),
             credits = 12)
classes$ID
## [1] "STATS 32"  "STATS 101" "STATS 200"
classes[["credits"]]
## [1] 12

Data frames

A special type of list:

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Getting a feel for your data

Agenda for today

Words vs. pictures

“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6
## 11 17.8  3.440         6
## 12 16.4  4.070         8
## 13 17.3  3.730         8
## 14 15.2  3.780         8
## 15 10.4  5.250         8
## 16 10.4  5.424         8
## 17 14.7  5.345         8
## 18 32.4  2.200         4
## 19 30.4  1.615         4
## 20 33.9  1.835         4
## 21 21.5  2.465         4
## 22 15.5  3.520         8
## 23 15.2  3.435         8
## 24 13.3  3.840         8
## 25 19.2  3.845         8
## 26 27.3  1.935         4
## 27 26.0  2.140         4
## 28 30.4  1.513         4
## 29 15.8  3.170         8
## 30 19.7  2.770         6
## 31 15.0  3.570         8
## 32 21.4  2.780         4

Words vs. pictures

“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6
## 11 17.8  3.440         6
## 12 16.4  4.070         8
## 13 17.3  3.730         8
## 14 15.2  3.780         8
## 15 10.4  5.250         8
## 16 10.4  5.424         8
## 17 14.7  5.345         8
## 18 32.4  2.200         4
## 19 30.4  1.615         4
## 20 33.9  1.835         4
## 21 21.5  2.465         4
## 22 15.5  3.520         8
## 23 15.2  3.435         8
## 24 13.3  3.840         8
## 25 19.2  3.845         8
## 26 27.3  1.935         4
## 27 26.0  2.140         4
## 28 30.4  1.513         4
## 29 15.8  3.170         8
## 30 19.7  2.770         6
## 31 15.0  3.570         8
## 32 21.4  2.780         4

Two classes of variables in statistics

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

Not so good…

Easier to see the trend

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

Heatmaps: categorical variable vs. categorical variable

How often does each pair of cylinder and gear occur in the dataset?

Summary

Case study

I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?

Data visualization in R: 2 broad approaches

base R

ggplot2

How can we describe a graphic?

Hadley Wickham

3 essential elements of graphics: data, geometries, aesthetics

Data: Dataset we are using for the plot

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6

3 essential elements of graphics: data, geometries, aesthetics

Geometries: Visual elements used for our data

Geom: point

3 essential elements of graphics: data, geometries, aesthetics

Aesthetics: Defines the data columns which affect various aspects of the geom

3 different aesthetics:

Examples of other aesthetics

Examples of other aesthetics

ggplot2 code

ggplot()

ggplot2 code

ggplot() +
    geom_histogram(data = df, mapping = aes(x = mpg))

ggplot2 code

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg))

ggplot2 code

ggplot() +
    geom_point(data = df, 
               mapping = aes(x = weight, y = mpg, col = cylinders),
               shape = 15)

Today’s dataset: World Bank data

(Source: flickr and World Bank)

DataBank homepage

Interface for World Development Indicators









Optional material

Full specification of a graphic

One graphic contains:

Other grammatical elements: position

Sometimes we need to tweak the position of the geometric elements because they obscure each other.

Only 9 data points??

Much better

Scales example: colors

Default colors

Manually chosen colors

Shapes in R

Colors in R